R for genetics/genomics?R anyhow to analyse non genetic dataRRadegenet, poppr and pegasR Markdown to produce a report of the analyses, incorporating the workflow and resultsThis document is an R Markdown document. An R Markdown document is the combination of a Markdown document with chunks of R code in the middle that are evaluated using an R package called knitr.
R Markdown document can be used interactively or they can be turned into beautiful HTML document or other (e.g. PDF, Word, …) if you click on the button “Knit” on top.
The benefits of using an R Markdown document is that you have your text, scripts, results, plots and tables in a single place.
You can edit the current R Markdown file to add your own notes to it (but save it under a different name, so that you can go back to the original version if you need to) and to try out or even modify the R code.
If you prefer to see pure R code only, just do:
#file.remove("markdown_R4G3_course.R")
knitr::purl("markdown_R4G3_course.Rmd")
Note: it will create a file called markdown_R4G3_course.R but it won’t overwrite it if it already exists. So, if you want some changes to be taken into account, make sure you delete the R file beforehand (e.g., by running the first line in the chunk above).
Just write plain text in the script. You can use different number of underscores to indicate:
For example _word_ is rendered as word.
knitr syntaxUse ```{r} to start an R chunck and ``` to close it.
For example if you type 1 + 1 within a chunk, it will be displayed as:
1 + 1
## [1] 2
Then, to evaluate the chunck, just press CTRL-R after having put your mouse cursor inside the chunck (anywhere).
You can learn more about R Markdown on this online book or if you already know a little, the R Studio cheatsheet will help you to keep it all in mind.
R packages readyWe will use the R packages adegenet, poppr and pegas for the analyses and ggplot2, lattice and viridisLite for the plots. If you have already installed before, you can simply load the packages as follow:
library(adegenet)
library(poppr)
library(pegas)
library(ggplot2)
library(lattice)
library(viridisLite)
Note 1: If one of the package is missing, then you must install it BEFORE loading them. So you may need to use, for ex:
install.packages("adegenet")
Note 2: To be able to “knit” the R Markdown, you will also need other R packages but try kniting and R Studio should do the rest for you.
We also add a few extra functions we coded for you:
source("scripts/tools.R")
Rgenepop formatWe will be using a very common type of input file for microsatellite data, which was originally developed for a stand-alone program called Genepop. You will learn how to create a genepop file with Jörns tomorrow. Today we will use one that we have made for you.
Here is the basic structure:
comment line
locus-1, locus-2, locus-3, ...
pop
Ind-1 , 100102 135135 204208 ...
Ind-2 , 100100 131139 200208 ...
Ind-3 , 102102 131139 200204 ...
Ind-4 , 000000 135139 208208 ...
...
.
Everyone comfortable with the terms locus, allele and genotype?
RIn the folder data in this R Studio project you can find the genepop file called “nancycats.gen”.
We will create an R object that contains the data in nancycats.gen using the read.genepop() function of adegenet. Here we need to provide the path of the file as "/data/nancycats.gen" using file =, and some information about how microsatellite data is encoded in the file using ncode =. (And here we also set the argument quiet = but you don’t have to, this is just to shorten this document by not displaying some messages.)
myData <- read.genepop(file = "data/nancycats.gen", ncode = 3)
##
## Converting data from a Genepop .gen file to a genind object...
##
##
## File description: nancycats dataset from adegenet (incl. mods)
##
## ...done.
Why did we use ncode = 3 ?
Now the data is read into R.
If there had been a problem with the data file (e.g. incorrectly formatted), then we would have gotten an error message. Let’s try:
badFormat <- read.genepop(file = "data/badFormatting.gen", ncode = 3, quiet = TRUE)
## Error in dimnames(x) <- dn: length of 'dimnames' [1] not equal to array extent
We “broke” this input file by including fewer locus names than data. The error message is telling us that the length of dimnames [1] (= number of locus names) does not correspond to the number of columns containing the data.
As you can see, we are lucky that R provides us a clue about where the problem is.
Another way we can generate an error is by providing wrong arguments to the function, despite having a properly formatted file:
badInput <- read.genepop(file = "data/nancycats.gen", ncode = 2, quiet = TRUE)
## Error in read.genepop(file = "data/nancycats.gen", ncode = 2, quiet = TRUE): some alleles are not encoded with 2 characters
## Check 'ncode' argument
Unsurprisingly, there will be an error if you type the path to the file incorrectly.
One way to limit such issues is to list all genepop files in the folder containing the data (to make sure that the file is where you are trying to read it):
dir(path = "data/", pattern = "*.gen")
## [1] "badFormatting.gen" "lynx.gen" "nancycats.gen"
R representations of your datagenind formatWe haven’t looked at our in data yet!
myData
## /// GENIND OBJECT /////////
##
## // 133 individuals; 9 loci; 104 alleles; size: 85.2 Kb
##
## // Basic content
## @tab: 133 x 104 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 8-16)
## @loc.fac: locus factor for the 104 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: read.genepop(file = "data/nancycats.gen", ncode = 3, quiet = TRUE)
##
## // Optional content
## @pop: population of each individual (group size range: 10-25)
Geeky note: For more details on the content of the genind object, just try:
str(myData)
or for even more geeky details:
print.AsIs(myData)
As you will see, the genind object is a particular kind of list. Technically it is called a S4 object. Instead of elements accessible with $, S4 objects have slots accessible with @ (even if the programmer behind adegenet made it possible to use $ anyway). ___
Since manipulating such objects is a little complicated, you should use the accessors directly available:
nInd(myData)
## [1] 133
nLoc(myData)
## [1] 9
nPop(myData)
## [1] 8
locNames(myData)
## [1] "fca8" "fca23" "fca43" "fca45" "fca77" "fca78" "fca90" "fca96" "fca37"
alleles(myData)
## $fca8
## [1] "137" "141" "129" "133" "131" "135" "139" "147" "123" "149" "145" "143" "127" "119" "117" "121"
##
## $fca23
## [1] "130" "136" "146" "144" "138" "132" "142" "128" "140" "150" "148"
##
## $fca43
## [1] "137" "145" "135" "133" "139" "149" "141" "147" "143"
##
## $fca45
## [1] "128" "126" "130" "120" "118" "132" "122" "116" "134"
##
## $fca77
## [1] "152" "144" "150" "156" "146" "158" "148" "160" "154" "142" "162" "132"
##
## $fca78
## [1] "142" "150" "140" "148" "146" "144" "138" "152"
##
## $fca90
## [1] "193" "199" "191" "185" "203" "197" "195" "201" "187" "205" "189" "181"
##
## $fca96
## [1] "091" "113" "101" "121" "107" "109" "117" "103" "115" "111" "119" "105"
##
## $fca37
## [1] "182" "208" "206" "192" "194" "214" "200" "220" "210" "218" "212" "186" "216" "204" "184"
indNames(myData)
## [1] "N7" "N141" "N142" "N143" "N144" "N145" "N146" "N147" "N148" "N149" "N151" "N153" "N154" "N155" "N156" "N157" "N158" "N159" "N160" "N161" "N162" "N163"
## [23] "N463" "N35" "N36" "N37" "N38" "N39" "N40" "N41" "N42" "N44" "N45" "N46" "N47" "N48" "N49" "N50" "N51" "N52" "N53" "N54" "N239" "N240"
## [45] "N250" "N268" "N450" "N468" "N80" "N81" "N82" "N83" "N84" "N85" "N86" "N87" "N187" "N188" "N189" "N190" "N191" "N192" "N43" "N92" "N94" "N95"
## [67] "N96" "N98" "N99" "N100" "N93" "N97" "N125" "N126" "N127" "N128" "N129" "N130" "N131" "N132" "N133" "N246" "N247" "N271" "N298" "N299" "N300" "N301"
## [89] "N302" "N303" "N304" "N310" "N193" "N194" "N195" "N196" "N197" "N198" "N199" "N200" "N201" "N202" "N203" "N206" "N207" "N209" "N210" "N211" "N212" "N204"
## [111] "N227" "N228" "N229" "N230" "N231" "N232" "N233" "N234" "N235" "N236" "N282" "N283" "N288" "N291" "N292" "N293" "N294" "N295" "N296" "N297" "N281" "N289"
## [133] "N290"
popNames(myData)
## [1] "N463" "N468" "N192" "N97" "N310" "N212" "N236" "N290"
Some of the accessors can also be used to redefine some information (to handle with great care):
#let's give the pops some names:
#myPops <- c("Pop1","Pop2","Pop3","Pop4","Pop5","Pop6","Pop7", "Pop8")
myPops <- paste0("Pop", 1:nPop(myData))
popNames(myData) <- myPops
popNames(myData)
## [1] "Pop1" "Pop2" "Pop3" "Pop4" "Pop5" "Pop6" "Pop7" "Pop8"
pop(myData)
## [1] Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop1 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2
## [32] Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop2 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3 Pop3
## [63] Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop4 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop5 Pop6
## [94] Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop6 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop7 Pop8 Pop8 Pop8 Pop8
## [125] Pop8 Pop8 Pop8 Pop8 Pop8 Pop8 Pop8 Pop8 Pop8
## Levels: Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7 Pop8
loci formatNot all R packages dealing with genetics/genomics use genind objects (it would be convenient if they did!). In particular, the package pegas which allows for the computation of many useful metrics relies on an alternative format called loci.
myData_loci <- genind2loci(myData) # or as.loci(myData)
myData_loci
## Allelic data frame: 133 individuals
## 9 loci
## 1 additional variable
The loci format is more simple than the geneind format (and so what you can do with it is more limited).
Geeky note: A loci object is simply a traditional data.frame (an S3 object) with an additional attribute called "locicol":
str(myData_loci)
## Classes 'loci' and 'data.frame': 133 obs. of 10 variables:
## $ population: Factor w/ 8 levels "Pop1","Pop2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ fca8 : Factor w/ 43 levels "123/123","123/143",..: 31 5 5 14 8 6 5 6 19 8 ...
## $ fca23 : Factor w/ 34 levels "128/150","128/130",..: 4 4 3 4 15 19 7 22 18 4 ...
## $ fca43 : Factor w/ 24 levels "133/133","135/135",..: 11 20 20 2 8 2 1 2 20 7 ...
## $ fca45 : Factor w/ 27 levels "118/132","116/120",..: 18 17 19 19 12 19 11 9 11 9 ...
## $ fca77 : Factor w/ 38 levels "142/142","144/144",..: 22 4 24 34 22 2 2 10 31 25 ...
## $ fca78 : Factor w/ 23 levels "138/138","140/140",..: 10 2 6 6 5 6 2 4 18 10 ...
## $ fca90 : Factor w/ 30 levels "181/185","185/185",..: 16 16 16 27 27 16 9 24 7 24 ...
## $ fca96 : Factor w/ 31 levels "091/091","091/103",..: 1 7 1 1 1 1 1 7 1 9 ...
## $ fca37 : Factor w/ 24 levels "182/182","182/192",..: 1 6 16 5 16 2 2 1 1 16 ...
## - attr(*, "locicol")= int 2 3 4 5 6 7 8 9 10
head(data.frame(myData_loci), n = 10) # first 10 rows
## population fca8 fca23 fca43 fca45 fca77 fca78 fca90 fca96 fca37
## N7 Pop1 137/141 130/136 137/145 128/128 152/152 142/150 193/199 091/091 182/182
## N141 Pop1 129/133 130/136 135/145 126/128 144/150 140/140 193/199 091/113 182/208
## N142 Pop1 129/133 130/130 135/145 128/130 152/156 142/142 193/199 091/091 208/208
## N143 Pop1 133/133 130/136 135/135 128/130 156/156 142/142 199/199 091/091 182/206
## N144 Pop1 131/135 136/136 137/137 126/130 152/152 140/142 199/199 091/091 208/208
## N145 Pop1 129/135 136/146 135/135 128/130 144/144 142/142 193/199 091/091 182/192
## N146 Pop1 129/133 130/144 133/133 126/126 144/144 140/140 191/191 091/091 182/192
## N147 Pop1 129/135 138/138 135/135 120/126 146/158 140/148 191/199 091/113 182/182
## N148 Pop1 135/135 136/144 135/145 126/126 146/156 140/150 185/191 091/091 182/182
## N149 Pop1 131/135 130/136 135/137 120/126 152/158 142/150 191/199 101/121 208/208
attr(myData_loci, "locicol") # columns that contain loci
## [1] 2 3 4 5 6 7 8 9 10
Let’s check some basic things about the data in myData.
An important consideration for analysis is the amount of missing data in our dataset. Several types of analyses don’t cope very well with missing data. In some cases this may lead to poor estimates, in other cases it may lead to samples or loci not being considered in the analysis.
We can use info_table() from poppr to find out about missing data, and also generate a convenient plot by including plot = TRUE.
info_table(myData, plot = TRUE, low = "white")
## Locus
## Population fca8 fca23 fca43 fca45 fca77 fca78 fca90 fca96 fca37 Mean
## Pop1 . . . . . . . . . .
## Pop2 . . . . . . . . . .
## Pop3 0.357 . . . . . . . . 0.040
## Pop4 . . . . . . . . . .
## Pop5 0.150 . . 0.400 . . . 0.050 . 0.067
## Pop6 0.412 . . . . . . . . 0.046
## Pop7 . . . . . . . . . .
## Pop8 . . . 1.000 . . . 0.615 . 0.179
## Total 0.113 . . 0.158 . . . 0.068 . 0.038
Geeky note: it is tricky but you can modify anything in this plot even when info_table() as no option for it:
theplot <- ggplot_build(last_plot())
theplot$data[[4]]$size <- 3
theplot$data[[4]]$angle <- 45
plot(ggplot_gtable(theplot))
We can see that there is missing data at several loci, and that this is particularly bad for our population Pop8.
As a general rule of thumb, we try to keep missing data below 5% in order to minimize the impact on analyses.
How do you think we can achieve this?
Using genotype_curve() we can check how many loci we need in order to discriminate our genotypes. This poppr function randomly samples loci without replacement and counts the number of multilocus genotypes observed.
genotype_curve(myData)
How many loci do we *need* in order to discriminate our genotypes?
Geeky note: you can get the exact numbers:
gencurv <- genotype_curve(myData)
apply(gencurv, 2, range)
## NumLoci
## 1 2 3 4 5 6 7 8
## [1,] 23 74 111 126 128 129 130 130
## [2,] 44 110 130 130 130 130 130 130
As missing data is concentrated in a few loci and populations, rather than being thinly spread throughout the dataset, we can remove the worst offenders: a locus (fca8) and a population (Pop8).
We can ‘drop’ a population from the genind object using the following code, if we know the number (ID) of the population in the list of populations.
myData[pop = -c(8)]
However, we would like to be able to remove populations by their name. For this we need a bit more code:
removePop <- c("Pop8")
myData <- myData[pop = !popNames(myData) %in% removePop]
Removing loci from a genind object works in a similar fashion. If you know the number (ID) of a locus, then you can use the short code myData[loc = -c(1)]. But again, it is better to use names.
removeLoc <- c("fca8")
myData <- myData[loc = !locNames(myData) %in% removeLoc]
Now let’s check myData and also run info_table() to see what’s changed.
myData
## /// GENIND OBJECT /////////
##
## // 120 individuals; 8 loci; 88 alleles; size: 69.1 Kb
##
## // Basic content
## @tab: 120 x 88 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 8-15)
## @loc.fac: locus factor for the 88 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: .local(x = x, i = i, j = j, loc = ..1, drop = drop)
##
## // Optional content
## @pop: population of each individual (group size range: 10-25)
info_table(myData, plot = TRUE, low = "white")
## Locus
## Population fca23 fca43 fca45 fca77 fca78 fca90 fca96 fca37 Mean
## Pop1 . . . . . . . . .
## Pop2 . . . . . . . . .
## Pop3 . . . . . . . . .
## Pop4 . . . . . . . . .
## Pop5 . . 0.4000 . . . 0.0500 . 0.0563
## Pop6 . . . . . . . . .
## Pop7 . . . . . . . . .
## Total . . 0.0667 . . . 0.0083 . 0.0094
We have significantly reduced the amount of missing data in myData.
How else could we have solved this?
How about our ability to discriminate genotypes?
genotype_curve(myData)
Have you noticed something?
Depending on how we obtain our genetic samples, we may not know if we have genotyped the same individual more than once.
For example, if we have been genotyping samples collected in a non-invasive way (e.g. hair or faeces), then we don’t really have a way of knowing if we have obtained more than one sample per individual.
There is an R package called alleleMatch that can investigate this in detail. However, it is no longer maintained. Thankfully, poppr has some basic functionality to examine this. Using mlg() we can check for the number of unique multilocus genotypes in myData.
mlg(myData)
## #############################
## # Number of Individuals: 120
## # Number of MLG: 117
## #############################
## [1] 117
We have 120 samples, but only 117 unique multilocus genotypes. This suggests that 3 samples correspond to replicates from the same individual(s). (We introduce that in the data on purpose to show you how to deal with such issues.)
Note: the mlg() function did not tell us which samples are the same. For this we can use mlg.id(). But this gives us a very long list, because it also includes the genotypes that occur only once. Check using head():
head(mlg.id(myData))
## $`1`
## [1] "N247"
##
## $`2`
## [1] "N95"
##
## $`3`
## [1] "N133"
##
## $`4`
## [1] "N84"
##
## $`5`
## [1] "N190"
##
## $`6`
## [1] "N80"
What we want to know is which are the genotypes that occur more than once?
We achieve this by looking for entries in the list with a length greater than one.
myMLG <- mlg.id(myData)
myMLG[lengths(myMLG) > 1]
## $`77`
## [1] "N268" "N468"
##
## $`80`
## [1] "N250" "N450"
##
## $`87`
## [1] "N163" "N463"
As we do not want to include replicates of samples in our data, we should now remove one of each replicated genotype. We will remove N450, N463, and N468.
myData <- myData[!indNames(myData) %in% c("N450","N463","N468")]
myData
## /// GENIND OBJECT /////////
##
## // 117 individuals; 8 loci; 88 alleles; size: 67.7 Kb
##
## // Basic content
## @tab: 117 x 88 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 8-15)
## @loc.fac: locus factor for the 88 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: .local(x = x, i = i, j = j, drop = drop)
##
## // Optional content
## @pop: population of each individual (group size range: 10-23)
Now we have 117 unique genotypes in myData.
If this were data from non-invasively collected samples, we may expect there to be more duplicate genotypes that we have not detected yet. Why?
Geeky note:
If you have to remove a lot of samples, you can do it easily by replacing the previous call by:
samplesToKeep <- unlist(lapply(myMLG, function(x) x[1]))
myData <- myData[indNames(myData) %in% samplesToKeep]
info_table() can also be used to find out about the ploidy of your data. As the results are presented for all samples, here just the first couple lines of output (achieved by using head())
head(info_table(myData, type = "ploidy"))
## Loci
## Samples fca23 fca43 fca45 fca77 fca78 fca90 fca96 fca37
## N7 2 2 2 2 2 2 2 2
## N141 2 2 2 2 2 2 2 2
## N142 2 2 2 2 2 2 2 2
## N143 2 2 2 2 2 2 2 2
## N144 2 2 2 2 2 2 2 2
## N145 2 2 2 2 2 2 2 2
table(info_table(myData, type = "ploidy"), useNA = "always") ## counts different values
##
## 2 <NA>
## 927 9
nLoc(myData) * nInd(myData) ## total
## [1] 936
How many alleles expect in a tetraploid?
Could this number be 1 for a diploid?
Another important consideration is if our loci are variable and informative. Otherwise we’re wasting time and space on keeping them! The function informloci() can be used to detect uninformative loci and remove them.
informloci(myData)
## cutoff value: 1.70940170940171 % ( 2 samples ).
## MAF : 0.01
##
## All sites polymorphic
## /// GENIND OBJECT /////////
##
## // 117 individuals; 8 loci; 88 alleles; size: 67.8 Kb
##
## // Basic content
## @tab: 117 x 88 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 8-15)
## @loc.fac: locus factor for the 88 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: .local(x = x, i = i, j = j, loc = ..1, drop = drop)
##
## // Optional content
## @pop: population of each individual (group size range: 10-23)
In our case, all loci are informative and they are kept. If we have uninformative loci, and want these to be removed, we need to run the following.
myData <- informloci(myData)
You can easily explore genetic differences using our home made little function:
#all_pairwise <- pairwise_similarity(myData)
#hist(all_pairwise$ratio)
lapply(seppop(myData), function(x) head(pairwise_similarity(x)))
## $Pop1
## id1 id2 distinct common total ratio
## 1 N141 N7 7 12 19 0.6315789
## 2 N142 N7 4 12 16 0.7500000
## 3 N142 N141 4 15 19 0.7894737
## 4 N143 N7 4 12 16 0.7500000
## 5 N143 N141 4 15 19 0.7894737
## 6 N143 N142 3 12 15 0.8000000
##
## $Pop2
## id1 id2 distinct common total ratio
## 1 N36 N35 5 13 18 0.7222222
## 2 N37 N35 6 13 19 0.6842105
## 3 N37 N36 4 14 18 0.7777778
## 4 N38 N35 11 13 24 0.5416667
## 5 N38 N36 10 14 24 0.5833333
## 6 N38 N37 8 11 19 0.5789474
##
## $Pop3
## id1 id2 distinct common total ratio
## 1 N81 N80 8 13 21 0.6190476
## 2 N82 N80 7 13 20 0.6500000
## 3 N82 N81 3 14 17 0.8235294
## 4 N83 N80 9 13 22 0.5909091
## 5 N83 N81 8 14 22 0.6363636
## 6 N83 N82 7 13 20 0.6500000
##
## $Pop4
## id1 id2 distinct common total ratio
## 1 N92 N43 13 12 25 0.4800000
## 2 N94 N43 9 12 21 0.5714286
## 3 N94 N92 9 16 25 0.6400000
## 4 N95 N43 6 12 18 0.6666667
## 5 N95 N92 8 16 24 0.6666667
## 6 N95 N94 9 12 21 0.5714286
##
## $Pop5
## id1 id2 distinct common total ratio
## 1 N126 N125 9 14 23 0.6086957
## 2 N127 N125 7 14 21 0.6666667
## 3 N127 N126 6 13 19 0.6842105
## 4 N128 N125 5 14 19 0.7368421
## 5 N128 N126 6 13 19 0.6842105
## 6 N128 N127 5 15 20 0.7500000
##
## $Pop6
## id1 id2 distinct common total ratio
## 1 N194 N193 5 14 19 0.7368421
## 2 N195 N193 6 14 20 0.7000000
## 3 N195 N194 6 12 18 0.6666667
## 4 N196 N193 5 14 19 0.7368421
## 5 N196 N194 4 12 16 0.7500000
## 6 N196 N195 7 12 19 0.6315789
##
## $Pop7
## id1 id2 distinct common total ratio
## 1 N227 N204 10 14 24 0.5833333
## 2 N228 N204 10 14 24 0.5833333
## 3 N228 N227 4 14 18 0.7777778
## 4 N229 N204 11 14 25 0.5600000
## 5 N229 N227 6 14 20 0.7000000
## 6 N229 N228 6 15 21 0.7142857
par(mfrow = c(3, 3))
lapply(seppop(myData), function(x) hist(pairwise_similarity(x)$ratio, xlim = c(0, 1)))
## $Pop1
## $breaks
## [1] 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
##
## $counts
## [1] 9 20 57 55 41 28 12 5 3 1
##
## $density
## [1] 0.77922078 1.73160173 4.93506494 4.76190476 3.54978355 2.42424242 1.03896104 0.43290043 0.25974026 0.08658009
##
## $mids
## [1] 0.475 0.525 0.575 0.625 0.675 0.725 0.775 0.825 0.875 0.925
##
## $xname
## [1] "pairwise_similarity(x)$ratio"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##
## $Pop2
## $breaks
## [1] 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90
##
## $counts
## [1] 14 30 65 75 49 13 5 0 2
##
## $density
## [1] 1.1067194 2.3715415 5.1383399 5.9288538 3.8735178 1.0276680 0.3952569 0.0000000 0.1581028
##
## $mids
## [1] 0.475 0.525 0.575 0.625 0.675 0.725 0.775 0.825 0.875
##
## $xname
## [1] "pairwise_similarity(x)$ratio"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##
## $Pop3
## $breaks
## [1] 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
##
## $counts
## [1] 9 24 24 14 11 6 2 1
##
## $density
## [1] 1.9780220 5.2747253 5.2747253 3.0769231 2.4175824 1.3186813 0.4395604 0.2197802
##
## $mids
## [1] 0.575 0.625 0.675 0.725 0.775 0.825 0.875 0.925
##
## $xname
## [1] "pairwise_similarity(x)$ratio"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##
## $Pop4
## $breaks
## [1] 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90
##
## $counts
## [1] 3 5 12 9 12 3 0 0 1
##
## $density
## [1] 1.3333333 2.2222222 5.3333333 4.0000000 5.3333333 1.3333333 0.0000000 0.0000000 0.4444444
##
## $mids
## [1] 0.475 0.525 0.575 0.625 0.675 0.725 0.775 0.825 0.875
##
## $xname
## [1] "pairwise_similarity(x)$ratio"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##
## $Pop5
## $breaks
## [1] 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
##
## $counts
## [1] 2 12 40 44 46 27 11 4 2 1 1
##
## $density
## [1] 0.2105263 1.2631579 4.2105263 4.6315789 4.8421053 2.8421053 1.1578947 0.4210526 0.2105263 0.1052632 0.1052632
##
## $mids
## [1] 0.425 0.475 0.525 0.575 0.625 0.675 0.725 0.775 0.825 0.875 0.925
##
## $xname
## [1] "pairwise_similarity(x)$ratio"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##
## $Pop6
## $breaks
## [1] 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85
##
## $counts
## [1] 13 24 42 20 22 10 4 1
##
## $density
## [1] 1.9117647 3.5294118 6.1764706 2.9411765 3.2352941 1.4705882 0.5882353 0.1470588
##
## $mids
## [1] 0.475 0.525 0.575 0.625 0.675 0.725 0.775 0.825
##
## $xname
## [1] "pairwise_similarity(x)$ratio"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##
## $Pop7
## $breaks
## [1] 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90
##
## $counts
## [1] 1 8 15 16 9 4 1 1
##
## $density
## [1] 0.3636364 2.9090909 5.4545455 5.8181818 3.2727273 1.4545455 0.3636364 0.3636364
##
## $mids
## [1] 0.525 0.575 0.625 0.675 0.725 0.775 0.825 0.875
##
## $xname
## [1] "pairwise_similarity(x)$ratio"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
# lapply(seppop(myData), pairwise_similarity) ## for all
par(mfrow = c(1, 1))
Consider the following two populations with different genotype frequencies:
AA Aa aa Pop1 50 0 50 Pop2 25 50 25
Do the genotype frequencies of the two populations differ?
Do the allele frequencies of the two populations differ?
... what Hardy-Weinberg equilibrium (HWE) is?
Genotype frequencies are the same in males and females.
Genotypes mate at random with respect to their genotype at this particular locus (panmixia).
Meiosis is fair.
There is no new genetic material (no mutation).
There is no gene flow (no migration).
The population is of infinite size (no drift).
All matings produce the same average number of offspring (no selection on fecundity).
There are no differences among genotypes in the probability of survival (no selection on survival).
Generations do not overlap (no selection on reproductive rate).
Using frequency of alleles in the current generation …
\(p_t = f(A)\)
\(q_t = f(a)\)
… the frequencies in the next generation can be calculated:
\(f(AA) = p_t^{2}\)
\(f(Aa) = 2p_tq_t\)
\(f(aa) = q_t^{2}\)
\(p_{t+1} = f(AA) + \dfrac{f(Aa)}{2} = p_t^2 + \dfrac{2p_tq_t}{2} = p_t^2 + p_tq_t = p_t^2 + p_t(1-p_t) = p_t^2 + p_t - p_t^2 = p_t\)
\(q_{t+1} = f(aa) + \dfrac{f(Aa)}{2} = q_t^2 + \dfrac{2p_tq_t}{2} = q_t^2 + p_tq_t = q_t^2 + (1-q_t)q_t = q_t^2 + q_t - q_t^2 = q_t\)
Under HWE, the allelic frequencies remain the same over time!
Evolution is a change in allele frequencies in a population over time
When a locus in a population is in HWE, it is not evolving: allele frequencies will stay the same across generations.
If HWE assumptions are not met, evolution can happen (allele frequencies may change).
Mutation, non-random mating, gene flow, genetic drift (caused by finite population size), and natural selection violate HWE assumptions and are thus “mechanisms” by which evolution may proceed.
The HWE is thus a the null model of micro-evolution.
So is it good/bad for a locus to be in HWE?
Very broadly speaking, there are two types of population genetics analyses you can do:
those that assume loci are in HWE, and
those that do not.
It is thus important to realise if your data reject the HWE assumptions!
With hw.test() from pegas we can test for HWE across our populations:
hw.test(myData, B = 0) ## for exact test instead of asymptotic, use large value of B
## chi^2 df Pr(chi^2 >)
## fca23 153.14590 55 3.417067e-11
## fca43 169.84216 36 0.000000e+00
## fca45 46.60491 36 1.109412e-01
## fca77 194.25536 66 1.454392e-14
## fca78 248.27891 28 0.000000e+00
## fca90 256.37179 66 0.000000e+00
## fca96 339.76874 66 0.000000e+00
## fca37 298.50220 105 0.000000e+00
Here we have conducted a \(\chi^{2}\)-test based on observed and expected genotype frequencies calculated from the allele frequencies.
Would we expect loci to be in HWE if we consider all populations together as we have done?
The Wahlund effect refers to the reduction in the number of heterozygotes due to population structure.
Consider two populations:
AA Aa aa Pop1 50 0 0 Pop2 0 0 50
Here, the two populations are in HWE.
However, if we treat them as a single population, there are no heterozygotes, and this merged-population is not in HWE.
Let’s check by population. We will split myData by population using the seppop function. This creates a list of genind objects, with every entry in the list consisting of a population.
To apply the HWE test with hw.test() to every item in the list (ie every population), we use the function lapply().
lapply(seppop(myData), hw.test)
## $Pop1
## chi^2 df Pr(chi^2 >) Pr.exact
## fca23 23.22222 10 9.955382e-03 0.255
## fca43 27.20034 6 1.328128e-04 0.001
## fca45 12.21716 15 6.625229e-01 0.458
## fca77 86.16667 45 2.152672e-04 0.000
## fca78 11.50468 6 7.397613e-02 0.008
## fca90 56.75227 15 9.039990e-07 0.000
## fca96 92.07157 15 4.066747e-13 0.000
## fca37 13.17284 15 5.889498e-01 0.211
##
## $Pop2
## chi^2 df Pr(chi^2 >) Pr.exact
## fca23 30.85879 28 0.3233864330 0.091
## fca43 10.77193 15 0.7685881191 0.794
## fca45 26.58116 21 0.1851504896 0.167
## fca77 35.34014 21 0.0259007804 0.014
## fca78 21.05849 15 0.1349710053 0.025
## fca90 61.74736 28 0.0002424509 0.011
## fca96 14.23444 21 0.8593197525 0.271
## fca37 25.96444 28 0.5749990390 0.116
##
## $Pop3
## chi^2 df Pr(chi^2 >) Pr.exact
## fca23 29.793651 15 0.012687711 0.007
## fca43 3.718750 10 0.959144052 0.750
## fca45 5.685319 6 0.459347253 0.376
## fca77 5.780408 6 0.448233208 0.175
## fca78 38.502857 15 0.000760033 0.003
## fca90 7.767857 6 0.255608831 0.186
## fca96 3.982222 6 0.679082361 0.425
## fca37 4.709877 3 0.194316502 0.101
##
## $Pop4
## chi^2 df Pr(chi^2 >) Pr.exact
## fca23 10.513889 10 0.396621136 0.551
## fca43 30.000000 21 0.091988007 0.199
## fca45 11.521794 10 0.318334361 0.402
## fca77 39.400000 21 0.008789371 0.001
## fca78 34.444444 21 0.032459406 0.001
## fca90 7.155556 6 0.306700905 0.369
## fca96 12.233560 15 0.661271220 0.529
## fca37 33.511111 21 0.040848059 0.010
##
## $Pop5
## chi^2 df Pr(chi^2 >) Pr.exact
## fca23 58.52901 45 0.084923238 0.186
## fca43 26.97092 21 0.171817797 0.016
## fca45 28.01333 28 0.463741139 0.017
## fca77 50.64042 36 0.053561033 0.000
## fca78 15.01731 10 0.131431929 0.023
## fca90 60.22659 36 0.006894609 0.094
## fca96 39.55940 28 0.072335908 0.278
## fca37 30.97222 21 0.074121484 0.031
##
## $Pop6
## chi^2 df Pr(chi^2 >) Pr.exact
## fca23 49.799592 15 1.298351e-05 0.001
## fca43 31.801084 21 6.132040e-02 0.042
## fca45 7.930211 10 6.356538e-01 0.604
## fca77 27.672222 15 2.371323e-02 0.008
## fca78 28.097222 15 2.096833e-02 0.096
## fca90 30.337654 21 8.540995e-02 0.006
## fca96 58.085432 36 1.128794e-02 0.017
## fca37 0.489298 3 9.212362e-01 0.854
##
## $Pop7
## chi^2 df Pr(chi^2 >) Pr.exact
## fca23 5.228395 6 0.514871229 0.595
## fca43 4.820988 6 0.566969462 0.663
## fca45 6.302083 10 0.789277136 0.893
## fca77 7.260000 10 0.700692473 0.721
## fca78 18.740741 6 0.004624641 0.019
## fca90 18.122500 15 0.256280157 0.279
## fca96 26.510000 15 0.032991782 0.055
## fca37 12.222222 6 0.057190992 0.254
unlist(lapply(seppop(myData), nInd)) ## careful: pop size differ
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7
## 22 23 14 10 20 17 11
# lapply(seppop(myData), function(x) hw.test(x)) ## SAME THING
Some loci, in some populations, are still not in HWE.
Why may this be?
Another important consideration for analysing genetic data is the independence of loci.
Imagine two loci that are physically very close on the same chromosome. Will alleles at these loci be segregating independently?
Do you know how this is measured?
For this reason, and many others (e.g. Fisher’s runaway), it is possible that alleles at one locus may be associated with alleles at another locus. In other words, we may observe non-random association of alleles at different loci. This is called linkage disequilibrium.
Why is this important? If two loci are not independent this will bias results.
We can look at this with the ia() function of poppr. This calculates an index of association over all loci in the genind object.
ia(myData, sample = 999)
## Ia p.Ia rbarD p.rD
## 0.032231701 0.371000000 0.004646249 0.371000000
Here we can see the distribution of expected association between alleles at different loci when there is no linkage (dark grey barplot), and the estimate for association among alleles in our total dataset (ie all loci and all pops at the same time).
It appears we do not have a significant association across all populations; i.e., no LD.
What if we look at the same test per population? To answer this question, we use seppop() and lapply() as before. But now we include ia() and for the argument sample, which defines the number of permutations used to draw the distribution of the association index under the null hypothesis, we want a large number, e.g., 999:
lapply(seppop(myData), ia, sample = 999)
## $Pop1
## Ia p.Ia rbarD p.rD
## 0.24839705 0.02600000 0.03597463 0.02500000
##
## $Pop2
## Ia p.Ia rbarD p.rD
## 0.34478904 0.00500000 0.04940941 0.00500000
##
## $Pop3
## Ia p.Ia rbarD p.rD
## 0.22960049 0.08600000 0.03294858 0.08800000
##
## $Pop4
## Ia p.Ia rbarD p.rD
## 0.24210526 0.16000000 0.03494786 0.16100000
##
## $Pop5
## Ia p.Ia rbarD p.rD
## 0.09058140 0.34800000 0.01313577 0.34700000
##
## $Pop6
## Ia p.Ia rbarD p.rD
## 0.25275944 0.04100000 0.03638827 0.04100000
##
## $Pop7
## Ia p.Ia rbarD p.rD
## 0.54708418 0.00700000 0.07877947 0.00700000
# lapply(seppop(myData), function(x) ia(x, sample = 999)) ## SAME THING
In some populations we observe significant LD.
Is this important?
If we want to know which two loci are in LD, we can use the pair.ia() function. Let’s consider Pop7 using seppop(myData)[["Pop7"]].
pair.ia(seppop(myData)[["Pop7"]])
## Ia rbarD
## fca23:fca43 0.4763 0.4800
## fca23:fca45 0.2649 0.2652
## fca23:fca77 0.4578 0.4578
## fca23:fca78 -0.1304 -0.1318
## fca23:fca90 0.1356 0.1357
## fca23:fca96 0.0305 0.0306
## fca23:fca37 -0.0434 -0.0437
## fca43:fca45 0.2775 0.2785
## fca43:fca77 0.2721 0.2745
## fca43:fca78 0.1330 0.1378
## fca43:fca90 0.2165 0.2174
## fca43:fca96 0.1113 0.1139
## fca43:fca37 -0.1151 -0.1183
## fca45:fca77 0.0020 0.0020
## fca45:fca78 0.0458 0.0466
## fca45:fca90 0.1371 0.1371
## fca45:fca96 0.0361 0.0364
## fca45:fca37 0.1287 0.1302
## fca77:fca78 -0.2098 -0.2116
## fca77:fca90 0.1019 0.1020
## fca77:fca96 0.0302 0.0303
## fca77:fca37 0.2262 0.2273
## fca78:fca90 0.1418 0.1440
## fca78:fca96 0.2988 0.2992
## fca78:fca37 -0.1809 -0.1810
## fca90:fca96 -0.0090 -0.0091
## fca90:fca37 -0.1912 -0.1932
## fca96:fca37 -0.1527 -0.1527
We can see that LD is not the same for all pairs of loci. Or can we? (Dany cannot!)
Let us change the plot so that it is easier for colour-blind people to see. For this we add some additional arguments for colour: low = "black" and high = "green".
pair.ia(myData[pop = "Pop7"], low = "black", high = "green")
## Ia rbarD
## fca23:fca43 0.4763 0.4800
## fca23:fca45 0.2649 0.2652
## fca23:fca77 0.4578 0.4578
## fca23:fca78 -0.1304 -0.1318
## fca23:fca90 0.1356 0.1357
## fca23:fca96 0.0305 0.0306
## fca23:fca37 -0.0434 -0.0437
## fca43:fca45 0.2775 0.2785
## fca43:fca77 0.2721 0.2745
## fca43:fca78 0.1330 0.1378
## fca43:fca90 0.2165 0.2174
## fca43:fca96 0.1113 0.1139
## fca43:fca37 -0.1151 -0.1183
## fca45:fca77 0.0020 0.0020
## fca45:fca78 0.0458 0.0466
## fca45:fca90 0.1371 0.1371
## fca45:fca96 0.0361 0.0364
## fca45:fca37 0.1287 0.1302
## fca77:fca78 -0.2098 -0.2116
## fca77:fca90 0.1019 0.1020
## fca77:fca96 0.0302 0.0303
## fca77:fca37 0.2262 0.2273
## fca78:fca90 0.1418 0.1440
## fca78:fca96 0.2988 0.2992
## fca78:fca37 -0.1809 -0.1810
## fca90:fca96 -0.0090 -0.0091
## fca90:fca37 -0.1912 -0.1932
## fca96:fca37 -0.1527 -0.1527
Much better ;)
lapply(seppop(myData), pair.ia, low = "black", high = "green")
## $Pop1
## Ia rbarD
## fca23:fca43 0.0337 0.0338
## fca23:fca45 0.0038 0.0039
## fca23:fca77 -0.0373 -0.0376
## fca23:fca78 0.0306 0.0308
## fca23:fca90 0.0133 0.0134
## fca23:fca96 -0.0257 -0.0264
## fca23:fca37 0.1288 0.1288
## fca43:fca45 0.1943 0.1966
## fca43:fca77 0.1966 0.2021
## fca43:fca78 0.0120 0.0120
## fca43:fca90 -0.0328 -0.0329
## fca43:fca96 0.0203 0.0205
## fca43:fca37 0.1115 0.1122
## fca45:fca77 -0.0435 -0.0436
## fca45:fca78 0.0159 0.0161
## fca45:fca90 0.1829 0.1866
## fca45:fca96 0.0047 0.0049
## fca45:fca37 0.0033 0.0033
## fca77:fca78 0.0939 0.0970
## fca77:fca90 -0.1271 -0.1321
## fca77:fca96 0.0733 0.0780
## fca77:fca37 0.1483 0.1493
## fca78:fca90 -0.0163 -0.0163
## fca78:fca96 0.0518 0.0520
## fca78:fca37 0.0110 0.0111
## fca90:fca96 0.0233 0.0233
## fca90:fca37 -0.0201 -0.0204
## fca96:fca37 -0.0019 -0.0020
##
## $Pop2
## Ia rbarD
## fca23:fca43 -0.1198 -0.1198
## fca23:fca45 0.1944 0.1946
## fca23:fca77 -0.0536 -0.0536
## fca23:fca78 -0.0027 -0.0027
## fca23:fca90 0.0944 0.0944
## fca23:fca96 0.0218 0.0218
## fca23:fca37 -0.0684 -0.0691
## fca43:fca45 -0.0920 -0.0923
## fca43:fca77 0.1240 0.1240
## fca43:fca78 0.1349 0.1351
## fca43:fca90 -0.0166 -0.0167
## fca43:fca96 0.0423 0.0423
## fca43:fca37 0.4216 0.4279
## fca45:fca77 -0.0523 -0.0523
## fca45:fca78 -0.1235 -0.1236
## fca45:fca90 0.1536 0.1538
## fca45:fca96 0.2229 0.2230
## fca45:fca37 0.1505 0.1513
## fca77:fca78 0.1277 0.1277
## fca77:fca90 -0.0306 -0.0306
## fca77:fca96 0.0420 0.0420
## fca77:fca37 0.0110 0.0111
## fca78:fca90 -0.0098 -0.0098
## fca78:fca96 -0.0907 -0.0907
## fca78:fca37 -0.0491 -0.0494
## fca90:fca96 0.1768 0.1768
## fca90:fca37 0.0809 0.0819
## fca96:fca37 0.0537 0.0541
##
## $Pop3
## Ia rbarD
## fca23:fca43 -8.7e-02 -8.7e-02
## fca23:fca45 -5.6e-02 -5.6e-02
## fca23:fca77 3.6e-02 3.6e-02
## fca23:fca78 -1.0e-01 -1.0e-01
## fca23:fca90 1.1e-01 1.1e-01
## fca23:fca96 2.4e-01 2.4e-01
## fca23:fca37 3.1e-02 3.2e-02
## fca43:fca45 6.3e-02 6.3e-02
## fca43:fca77 6.0e-02 6.0e-02
## fca43:fca78 2.7e-01 2.7e-01
## fca43:fca90 1.5e-01 1.5e-01
## fca43:fca96 -1.5e-01 -1.5e-01
## fca43:fca37 6.7e-02 6.8e-02
## fca45:fca77 -1.1e-01 -1.1e-01
## fca45:fca78 5.3e-02 5.3e-02
## fca45:fca90 2.2e-01 2.2e-01
## fca45:fca96 -1.7e-01 -1.7e-01
## fca45:fca37 3.6e-01 3.6e-01
## fca77:fca78 -6.9e-02 -6.9e-02
## fca77:fca90 -7.5e-02 -7.5e-02
## fca77:fca96 7.4e-02 7.4e-02
## fca77:fca37 -1.1e-01 -1.1e-01
## fca78:fca90 -3.3e-02 -3.4e-02
## fca78:fca96 5.0e-02 5.0e-02
## fca78:fca37 2.4e-02 2.4e-02
## fca90:fca96 2.2e-16 2.8e-16
## fca90:fca37 3.4e-02 3.5e-02
## fca96:fca37 5.5e-02 5.5e-02
##
## $Pop4
## Ia rbarD
## fca23:fca43 0.1370 0.1371
## fca23:fca45 0.4550 0.4552
## fca23:fca77 -0.2619 -0.2650
## fca23:fca78 0.0303 0.0307
## fca23:fca90 -0.0847 -0.0847
## fca23:fca96 0.0264 0.0265
## fca23:fca37 0.0043 0.0043
## fca43:fca45 -0.0060 -0.0060
## fca43:fca77 0.1261 0.1284
## fca43:fca78 0.0145 0.0145
## fca43:fca90 -0.1102 -0.1103
## fca43:fca96 0.0690 0.0691
## fca43:fca37 -0.1377 -0.1393
## fca45:fca77 -0.2335 -0.2351
## fca45:fca78 0.0106 0.0108
## fca45:fca90 -0.1227 -0.1227
## fca45:fca96 0.1650 0.1665
## fca45:fca37 0.2585 0.2593
## fca77:fca78 -0.0755 -0.0791
## fca77:fca90 0.0811 0.0818
## fca77:fca96 -0.0165 -0.0171
## fca77:fca37 -0.1349 -0.1350
## fca78:fca90 -0.0731 -0.0741
## fca78:fca96 0.0155 0.0155
## fca78:fca37 -0.0627 -0.0649
## fca90:fca96 0.1749 0.1761
## fca90:fca37 0.3110 0.3124
## fca96:fca37 0.4809 0.4921
##
## $Pop5
## Ia rbarD
## fca23:fca43 0.0727 0.0732
## fca23:fca45 -0.1368 -0.1440
## fca23:fca77 -0.0169 -0.0169
## fca23:fca78 0.0582 0.0589
## fca23:fca90 0.1552 0.1553
## fca23:fca96 0.0853 0.0866
## fca23:fca37 0.2506 0.2564
## fca43:fca45 0.0403 0.0412
## fca43:fca77 0.0763 0.0769
## fca43:fca78 0.0261 0.0261
## fca43:fca90 0.1118 0.1128
## fca43:fca96 0.0041 0.0041
## fca43:fca37 -0.0122 -0.0122
## fca45:fca77 0.0286 0.0302
## fca45:fca78 -0.3668 -0.3727
## fca45:fca90 -0.1735 -0.1837
## fca45:fca96 -0.2480 -0.2508
## fca45:fca37 -0.2701 -0.2717
## fca77:fca78 0.1718 0.1736
## fca77:fca90 0.1279 0.1279
## fca77:fca96 0.1909 0.1938
## fca77:fca37 -0.1294 -0.1324
## fca78:fca90 0.1912 0.1937
## fca78:fca96 0.2730 0.2731
## fca78:fca37 0.1758 0.1762
## fca90:fca96 -0.1090 -0.1109
## fca90:fca37 0.0945 0.0970
## fca96:fca37 0.0825 0.0826
##
## $Pop6
## Ia rbarD
## fca23:fca43 0.128 0.129
## fca23:fca45 0.049 0.050
## fca23:fca77 0.333 0.333
## fca23:fca78 0.080 0.080
## fca23:fca90 0.021 0.021
## fca23:fca96 -0.077 -0.078
## fca23:fca37 0.132 0.133
## fca43:fca45 -0.145 -0.145
## fca43:fca77 0.222 0.227
## fca43:fca78 0.186 0.187
## fca43:fca90 -0.038 -0.038
## fca43:fca96 -0.015 -0.015
## fca43:fca37 -0.124 -0.124
## fca45:fca77 -0.140 -0.143
## fca45:fca78 -0.102 -0.103
## fca45:fca90 0.153 0.153
## fca45:fca96 -0.117 -0.117
## fca45:fca37 0.048 0.048
## fca77:fca78 -0.152 -0.153
## fca77:fca90 0.015 0.015
## fca77:fca96 -0.032 -0.033
## fca77:fca37 0.131 0.133
## fca78:fca90 0.157 0.157
## fca78:fca96 0.120 0.121
## fca78:fca37 -0.069 -0.069
## fca90:fca96 -0.052 -0.053
## fca90:fca37 0.179 0.179
## fca96:fca37 -0.023 -0.023
##
## $Pop7
## Ia rbarD
## fca23:fca43 0.4763 0.4800
## fca23:fca45 0.2649 0.2652
## fca23:fca77 0.4578 0.4578
## fca23:fca78 -0.1304 -0.1318
## fca23:fca90 0.1356 0.1357
## fca23:fca96 0.0305 0.0306
## fca23:fca37 -0.0434 -0.0437
## fca43:fca45 0.2775 0.2785
## fca43:fca77 0.2721 0.2745
## fca43:fca78 0.1330 0.1378
## fca43:fca90 0.2165 0.2174
## fca43:fca96 0.1113 0.1139
## fca43:fca37 -0.1151 -0.1183
## fca45:fca77 0.0020 0.0020
## fca45:fca78 0.0458 0.0466
## fca45:fca90 0.1371 0.1371
## fca45:fca96 0.0361 0.0364
## fca45:fca37 0.1287 0.1302
## fca77:fca78 -0.2098 -0.2116
## fca77:fca90 0.1019 0.1020
## fca77:fca96 0.0302 0.0303
## fca77:fca37 0.2262 0.2273
## fca78:fca90 0.1418 0.1440
## fca78:fca96 0.2988 0.2992
## fca78:fca37 -0.1809 -0.1810
## fca90:fca96 -0.0090 -0.0091
## fca90:fca37 -0.1912 -0.1932
## fca96:fca37 -0.1527 -0.1527
There are some typical descriptive statistics you will find in most population genetics papers. These summary statistics give us an overview of some important features of populations under study.
We will focus on the basic ones describing variation within populations and describing variation between populations.
Note: many of these can be calculated over all populations as well as per population or between pairs of populations. This offers information at different scales.
If we are interested in genetic variation in natural populations we often consider heterozygosity.
High heterozygosity means a lot of genetic variability.
Low heterozygosity means little genetic variability.
Let’s consider heterozygosity in a simple system with two alleles at a locus: A and a. Let’s also assume that this population is in HWE.
Then:
\(p = f(A)\)
\(q = f(a)\)
So under HWE, we obtain the following mating table:
A a A p^2 pq a qp q^2
So \(pq + qp = 2pq\) gives the frequency of heterozygote genotypes.
In this two-allele system, heterozygosity is highest at \(p = 0.5\). Let’s visualize:
freqP <- seq(from = 0, to = 1, by = 0.01)
freqQ <- 1 - freqP
plot(2*freqP*freqQ ~ freqQ, type = "l",
ylab = "frequency of heterozygote genotypes",
xlab = "frequency of allele 'a'")
text(0, 0, "AA")
text(1, 0, "aa")
arrows(0.45, 0, 0.05, 0)
text(0.5, 0, "Aa")
arrows(0.55, 0, 0.95, 0)
As you can imagine, this becomes more complex when there are more alleles per locus!
Questions?
Now we will calculate the observed heterozygosity in our populations, as well as the expected heterozygosity if the populations are in HWE.
For this we use the summary() function from adegenet. We can extract the observed heterozygosity using $Hobs and the expected heterozygosity using $Hexp.
To make life convenient, you can extract the values for every locus, place them in a data.frame and compute the difference:
heteroz <- data.frame(Hexp = summary(myData)$Hexp, Hobs = summary(myData)$Hobs)
heteroz$diff <- heteroz$Hexp - heteroz$Hobs
heteroz
## Hexp Hobs diff
## fca23 0.8044415 0.6666667 0.13777486
## fca43 0.7754767 0.6581197 0.11735700
## fca45 0.7687484 0.7155963 0.05315209
## fca77 0.8734020 0.6153846 0.25801739
## fca78 0.7748922 0.5982906 0.17660165
## fca90 0.8141939 0.6837607 0.13043319
## fca96 0.7565770 0.5948276 0.16174941
## fca37 0.6576813 0.4957265 0.16195485
We can also visualize how observed heterozygosity is different from expected heterozygosity by subtracting one from the other.
If you remember from before, we saw that loci were not in HWE over all populations.
barplot(heteroz$diff, names.arg = rownames(heteroz),
main = "Heterozygosity: expected-observed",
xlab = "locus", ylab = "Hexp - Hobs", font.lab = 2)
Or we can use another representation, using ggplot2 for a change:
heteroz$loci <- rownames(heteroz) ## ggplot needs names stored as a column
ggplot(heteroz, aes(y = Hexp, x = Hobs)) +
geom_segment(aes(yend = Hobs, xend = Hobs), linetype = "dashed") +
geom_label(aes(label = loci)) +
geom_abline(slope = 1) +
scale_x_continuous(limits = range(c(heteroz$Hobs, heteroz$Hexp))) +
scale_y_continuous(limits = range(c(heteroz$Hobs, heteroz$Hexp))) +
labs(title = "Heterozygosity: expected vs observed") +
xlab(expression(bold("Observed heterozygosity"))) +
ylab(expression(bold("Expected heterozygosity"))) +
theme_classic()
Now let’s consider observed and expected heterozygosity by population. Here, we will focus on the mean across loci.
adegenet conveniently provides us with a funtion to calculate this for expected heterozygosity with Hs().
Hs(myData)
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7
## 0.6747159 0.7258979 0.6246811 0.7206250 0.7556682 0.7480536 0.6854339
To obtain observed heterozygosity there is no short cut (yet), so we need to use lapply() and seppop() again. Here we will calculate the mean of $Hobs per population in myData. As the output of seppop is a list, we also use unlist() to get a simple vector.
Hobs <- lapply(seppop(myData), function(x) mean(summary(x)$Hobs))
Hobs <- unlist(Hobs)
Hobs
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7
## 0.5511364 0.6413043 0.5446429 0.6250000 0.6573465 0.6691176 0.7500000
Again, we can store the results for observed and expected heterozygosity into a data.frame and compute the difference:
heteroz_per_pop <- data.frame(Hexp = Hs(myData), Hobs = Hobs)
heteroz_per_pop$diff <- heteroz_per_pop$Hexp - heteroz_per_pop$Hobs
heteroz_per_pop
## Hexp Hobs diff
## Pop1 0.6747159 0.5511364 0.12357955
## Pop2 0.7258979 0.6413043 0.08459357
## Pop3 0.6246811 0.5446429 0.08003827
## Pop4 0.7206250 0.6250000 0.09562500
## Pop5 0.7556682 0.6573465 0.09832170
## Pop6 0.7480536 0.6691176 0.07893599
## Pop7 0.6854339 0.7500000 -0.06456612
We have already tested if these are significantly different above.
Geeky note:
To look at both the effect of the locus and the population at once, you can write some custom code:
heteroz_matrix <- do.call("cbind", lapply(seppop(myData),
function(x) summary(x)$Hexp - summary(x)$Hobs))
heteroz_matrix
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7
## fca23 0.070247934 0.11720227 0.211734694 -0.170 0.04750000 0.190311419 -0.21074380
## fca43 0.094008264 -0.03402647 0.012755102 0.050 0.13875000 0.211072664 -0.22727273
## fca45 0.051652893 -0.03875236 -0.005102041 -0.040 0.14930556 -0.138408304 -0.11157025
## fca77 0.253099174 0.18336484 0.165816327 0.290 0.26250000 -0.126297578 -0.10743802
## fca78 0.237603306 0.16446125 0.257653061 0.320 0.03875000 0.008650519 -0.21074380
## fca90 0.001033058 0.11531191 -0.150510204 -0.035 0.06375000 0.262975779 0.12809917
## fca96 0.215909091 0.10396975 0.058673469 0.050 -0.06648199 0.186851211 0.17768595
## fca37 0.065082645 0.06521739 0.089285714 0.300 0.15250000 0.036332180 0.04545455
This can be easily plotted with lattice:
levelplot(heteroz_matrix, col.regions = viridis(100),
main = "Heterozygosity: expected-observed",
xlab = "Locus", ylab = "Population")
___
An effect of population subdivision on genetic variation is the reduction in observed heterozygotes. As we have seen previously (Wahlund effect).
The extent of reduction in observed heterozygotes can be used to quantify the level of differentiation between subpopulations.
This quantification is formalised in a series of hierarchical F-statistics.
–> We are using a measure of departure from HWE to quantify the extent of differentiation between populations.
F-statistics also provide a way to quantify the level of heterozygosity at the individual level, and if/how this departs from expectations of HWE in the population.
How are these linked?
F-statistics, from https://en.wikipedia.org/wiki/F-statistics
We will focus on the two most common F-statistics: \(F_{IS}\) and \(F_{ST}\).
Let’s calculate \(F_{ST}\) for our previous example.
AA Aa aa Pop1 50 0 0 Pop2 0 0 50
Here, \(p = 0.5\) and \(q = 0.5\).
Our expected heterozygosity for the total population (ie. Pop1 and Pop2 together) is given by
\(H_T = 2pq = 2 \times 0.5 \times 0.5 = 0.5\)
As we have no variation in the subpopulations (ie. Pop1 or Pop2) \(H_S = 0\) because \(2pq = 2 \times 1 \times 0\)
Then
\(F_{ST} = (0.5 - 0) / 0.5 = 1.0\)
Note: this has been a simple exploration of how F-statistics are calculated. This has been expanded upon, and not all software or R packages calculate F-stats in the same way. For example, some account for factors such as how individuals disperse (island model vs stepping-stone model), the mutation process (infinite alleles model vs step-wise mutation model) and other bias adjusters (e.g. taking into account sampling bias). It is thus important to understand and report which method you use.
To cite a package:
citation(package = "pegas")
##
## To cite pegas in a publication use:
##
## Paradis E. 2010. pegas: an R package for population genetics with an integrated-modular approach. Bioinformatics 26: 419-420.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {pegas: an {R} package for population genetics with an integrated--modular approach},
## author = {E. Paradis},
## journal = {Bioinformatics},
## year = {2010},
## volume = {26},
## pages = {419-420},
## }
##
## As pegas is evolving quickly, you may want to cite also its version number (found with 'library(help = pegas)').
packageVersion("pegas") ## don't forget to cite the version number!
## [1] '0.11'
Now let’s get back to myData.
We can use the Fst() function of the pegas package to calculate F-stats by locus, over all populations. But it requires the loci format, so we need to use as.loci() to transform our data.
Fst(as.loci(myData))
## Fit Fst Fis
## fca23 0.18478245 0.07688651 0.116882646
## fca43 0.16430065 0.07043953 0.100973655
## fca45 0.08454681 0.07840534 0.006663954
## fca77 0.31145915 0.11736447 0.219903538
## fca78 0.23714930 0.04571881 0.200601766
## fca90 0.17238071 0.06492421 0.114917430
## fca96 0.22382373 0.05128694 0.181864038
## fca37 0.25929920 0.08156729 0.193516523
How would you interpret these values?
We can also build a data.frame with all the information we need:
data.frame(Hobs = summary(myData)$Hobs, Hexp = summary(myData)$Hexp, Fst(as.loci(myData))[, c("Fst", "Fis")])
## Hobs Hexp Fst Fis
## fca23 0.6666667 0.8044415 0.07688651 0.116882646
## fca43 0.6581197 0.7754767 0.07043953 0.100973655
## fca45 0.7155963 0.7687484 0.07840534 0.006663954
## fca77 0.6153846 0.8734020 0.11736447 0.219903538
## fca78 0.5982906 0.7748922 0.04571881 0.200601766
## fca90 0.6837607 0.8141939 0.06492421 0.114917430
## fca96 0.5948276 0.7565770 0.05128694 0.181864038
## fca37 0.4957265 0.6576813 0.08156729 0.193516523
Often we are also interested in specific \(F_{ST}\) values between pairs of populations. Genetic differentiation among all populations does not need to be equal.
Let’s look at \(F_{ST}\) between pops 1 and 2. We will extract the first two populations from myData and turn this into loci format using as.loci().
myData_pop12 <- as.loci(myData[pop = c("Pop1", "Pop2")])
Now we estimate \(F_{ST}\) over all loci, obtained by using mean(), and constrain ourselves to the column of the output with the \(F_{ST}\) values:
mean(Fst(myData_pop12)[, "Fst"])
## [1] 0.1031605
Now we have a measure of differentiation between Pop1 and Pop2.
This pairwise \(F_{ST}\) value is the mean of \(F_{ST}\) values of all loci. We can check \(F_{ST}\) for every locus using Fst(myData_pop12)[, "Fst"].
As any estimate, the Fst is measured with some uncertainty. To represent this uncertainty, we can compute the 95\(\%\) confidence interval of the mean Fst value using a “very simple” bootstrap (more advanced technics are possible):
res <- replicate(100, {
myData_pop12 <- myData_pop12[, c(1, sample(attr(myData_pop12, "loci"), replace = TRUE))]
mean(Fst(myData_pop12)[, "Fst"])
})
quantile(res, c(0.025, 0.975))
## 2.5% 97.5%
## 0.08340952 0.12319740
hist(res, xlim = c(0, 0.2)); abline(v = 0, col = 2, lwd = 2, lty = 2)
Geeky note:
For now, there is no simple solution to compute pairwise Fst in R, but here is a solution using a home made function we prepared for you:
pairwise_F(myData, confint = FALSE) ## without Confidence Interval
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7
## Pop1 NA 0.10316054 0.11176328 0.07823261 0.08678415 0.11003071 0.08819144
## Pop2 0.10316054 NA 0.07680306 0.03648835 0.02176617 0.04722565 0.04794162
## Pop3 0.11176328 0.07680306 NA 0.05316041 0.08268424 0.11785210 0.11157795
## Pop4 0.07823261 0.03648835 0.05316041 NA 0.04222016 0.04885717 0.04768922
## Pop5 0.08678415 0.02176617 0.08268424 0.04222016 NA 0.04532272 0.04676997
## Pop6 0.11003071 0.04722565 0.11785210 0.04885717 0.04532272 NA 0.05978869
## Pop7 0.08819144 0.04794162 0.11157795 0.04768922 0.04676997 0.05978869 NA
pairwise_Fst <- pairwise_F(myData, confint = TRUE) ## with CI
lapply(pairwise_Fst, round, digit = 2) ## output after rounding with 2 digits
## $mean
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7
## Pop1 NA 0.10 0.11 0.08 0.09 0.11 0.09
## Pop2 0.10 NA 0.08 0.04 0.02 0.05 0.05
## Pop3 0.11 0.08 NA 0.05 0.08 0.12 0.11
## Pop4 0.08 0.04 0.05 NA 0.04 0.05 0.05
## Pop5 0.09 0.02 0.08 0.04 NA 0.05 0.05
## Pop6 0.11 0.05 0.12 0.05 0.05 NA 0.06
## Pop7 0.09 0.05 0.11 0.05 0.05 0.06 NA
##
## $upr
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7
## Pop1 NA 0.12 0.15 0.10 0.10 0.13 0.11
## Pop2 0.12 NA 0.11 0.06 0.04 0.07 0.07
## Pop3 0.15 0.11 NA 0.08 0.13 0.14 0.15
## Pop4 0.10 0.06 0.08 NA 0.07 0.08 0.07
## Pop5 0.10 0.04 0.13 0.07 NA 0.06 0.09
## Pop6 0.13 0.07 0.14 0.08 0.06 NA 0.09
## Pop7 0.11 0.07 0.15 0.07 0.09 0.09 NA
##
## $lwr
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7
## Pop1 NA 0.08 0.07 0.06 0.06 0.08 0.07
## Pop2 0.08 NA 0.04 0.01 0.00 0.01 0.02
## Pop3 0.07 0.04 NA 0.02 0.04 0.09 0.07
## Pop4 0.06 0.01 0.02 NA 0.01 0.01 0.03
## Pop5 0.06 0.00 0.04 0.01 NA 0.03 0.01
## Pop6 0.08 0.01 0.09 0.01 0.03 NA 0.02
## Pop7 0.07 0.02 0.07 0.03 0.01 0.02 NA
levelplot(pairwise_Fst$mean, col.regions = rev(grey.colors(30)))
It is worth checking if your populations have alleles that cannot be found in other populations. These are called “private alleles.”
We can use the private_alleles() function of poppr to get information about which alleles can only be found in a population.
Alleles are labelled as locus.allele.
private_alleles(myData)
## fca23.148 fca43.143 fca77.132 fca78.138 fca90.205 fca90.189 fca90.181 fca96.101 fca96.121 fca37.192 fca37.194 fca37.200 fca37.218 fca37.212 fca37.186
## Pop1 0 0 0 0 0 0 0 1 1 3 2 0 0 0 0
## Pop2 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
## Pop3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Pop4 0 1 0 2 0 0 0 0 0 0 0 0 0 2 1
## Pop5 2 0 1 0 1 1 0 0 0 0 0 0 0 0 0
## Pop6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Pop7 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
apply(private_alleles(myData), 1, sum) ## number of private alleles per pop
## Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7
## 7 2 0 6 5 0 2
Private alleles will impact measures of differentiation between populations. Especially, if their frequency is high (i.e., if they are common).
plot(summary(myData)$pop.n.all ~ summary(myData)$n.by.pop, xlab = "Pop size", ylab = "Number of alleles", las = 1)
___
Individual-based analyses frequently do not make assumptions about HWE, because they look at the differences between individuals.
But consider how this kind of information must be presented.
individual x individual
It is not surprising that we have come up with ways to visualize this kind of information.
Two very common ways are trees (or networks) and PCA.
One benefit of displaying information in this manner is that we can see if there are groupings of individuals that are more similar to eachother than they are to other individuals.
adegenet and poppr offer us the ability to generate other distance measures. Most of the ones offered in poppr work for individuals (rather than populations).
Note: several of these distance measures make strong assumptions about the biological nature of our genetic samples.
Let’s try another measure that makes no assumptions: Prevosti’s distance, which we can use with prevosti.dist()
mynj_prevosti <- nj(prevosti.dist(seppop(myData)[["Pop1"]]))
plot(mynj_prevosti, type = "unrooted")
Note: we did not need to use 1 - in this case, because the function prevosti.dist() directly provide a distance.
Let us compare the two trees we just made to see if they match:
cophyloplot(mynj, mynj_prevosti,
assoc = cbind(mynj$tip.label, mynj_prevosti$tip.label))
Are you happy with that?
Now that we know how to make a NJ tree for a small dataset, we can do it for all samples in myData.
bignj <- nj(prevosti.dist(myData))
plot(bignj, type = "unrooted")
Geeky note: we can improve this plot by plotting colours and changing the type of tree. Here a possibility would be a “fan” plot:
plot(bignj, type = "fan", show.tip.label = FALSE, x.lim = c(-0.7, 0.7), no.margin = TRUE)
tiplabels(text = rownames(myData@tab),
frame = "none",
col = rainbow(nPop(myData))[as.numeric(myData@pop)], cex = 0.8, offset = 0.05)
legend("topleft", fill = rainbow(nPop(myData)),
legend = popNames(myData), bty = "n",
title = "Population")
Or perhaps a “radial” plot:
plot(bignj, type = "radial", show.tip.label = FALSE)
tiplabels(text = rownames(myData@tab),
frame = "none",
col = rainbow(nPop(myData))[as.numeric(myData@pop)], cex = 0.8, offset = 0.05)
legend("topleft", fill = rainbow(nPop(myData)),
legend = popNames(myData), bty = "n",
title = "Population")
An alernative to neighbour joining is to reduce the dimensionality of the problem so that it can be ploted in 2 dimensions instead of one dimension per locus. Many technics exists but the most common one is the so-called Principal Component Analysis, or PCA for short.
This is how you run a PCA:
myData_matrix <- scaleGen(myData, center = FALSE, scale = FALSE, NA.method = "mean")
mypca <- dudi.pca(myData_matrix, center = TRUE, scale = FALSE, scannf = FALSE, nf = Inf)
The PCA creates new dimensions…
head(mypca$li[, 1:4]) ## only show head for first 4 axes
## Axis1 Axis2 Axis3 Axis4
## N7 3.768724 -1.0570158 0.3136979 0.1203990
## N141 3.103792 0.5957429 -1.0972396 0.3455816
## N142 2.802742 -1.8958561 0.7628661 0.6864441
## N143 3.420050 -0.6665903 0.4915670 -1.4854476
## N144 2.244425 -1.3172071 1.5874842 1.3820029
## N145 4.369164 1.9061879 0.0952608 -1.7985942
which are uncorrelated…
head(zapsmall(cor(mypca$li))[, 1:4]) ## only show head for first 4 axes
## Axis1 Axis2 Axis3 Axis4
## Axis1 1 0 0 0
## Axis2 0 1 0 0
## Axis3 0 0 1 0
## Axis4 0 0 0 1
## Axis5 0 0 0 0
## Axis6 0 0 0 0
and which capture a decreasing amount of variation of the original loci:
barplot(mypca$eig,
names.arg = colnames(mypca$li),
cex.names = 0.5,
col = heat.colors(length(mypca$eig)),
las = 2, ylab = "Inertia")
or in cumulated percentage:
barplot(cumsum(100*mypca$eig/sum(mypca$eig)),
names.arg = colnames(mypca$li),
cex.names = 0.5,
col = rev(heat.colors(length(mypca$eig))),
las = 2, log = "y",
ylab = "Cumulative proportion of variance explained")
There are many ways to plot a PCA, but here we are interested in projecting the individuals into the new loci space, so we use the function s.class():
s.class(mypca$li, fac = pop(myData),
col = rainbow(nPop(myData)), grid = FALSE, xax = 1, yax = 2, cpoint = 0)
s.label(mypca$li, add.plot = TRUE, boxes = FALSE, clabel = 0.5)
add.scatter.eig(mypca$eig[1:10], xax = 1, yax = 2, ratio = 0.15)
The PCA does not force the group to be different, it just shows the overall structure. In contrast, the DAPC is a method that allows to explore whether some combinations of alleles would allows the characterisation of distinct groups.
Here is an example:
myclusters <- find.clusters(myData, n.clust = 3, n.pca = Inf) ## better not fixing n.clust and selecting it interactivelly!!!
dapc1 <- dapc(myData, pop = myclusters$grp, n.pca = Inf, n.da = Inf)
scatter(dapc1)
So yes, it seems that it is possible to create groups… but it is not clear what these groups correspond to in this dataset:
s.class(dapc1$ind.coord, fac = pop(myData),
col = rainbow(nPop(myData)), grid = FALSE, xax = 1, yax = 2, cpoint = 1)
compoplot(dapc1, show.lab = TRUE, legend = FALSE, cex.names = 0.4,
lab = paste(pop(myData), rownames(dapc1$tab)))